Coping with Sparse Inputs on Enhanced Meshes - Semigroup Computation with COMMON CRCW Buses
نویسنده
چکیده
Consider an p n pn processor mesh where, in addition to the local links, each row and column is enhanced by a COMMON CRCW bus. Assume that each processor stores an element of a commutative semigroup, and only k < n entries (in arbitrary positions) are nonzero. We wish to compute the sum of all entries. For this problem we easily obtain a lower time bound of (k) if k n. Our main result is an O(k log k) time algorithm. It requires a composition of several data movement and compaction techniques which seem to be of general use for solving problems with sparse inputs scattered on the mesh, as it is typical e.g. for primal sketches in digital image processing. 1: Model, Motivation, and Problem The mesh-connected computer with row and column buses (other denotations in the literature are: mesh or processsor array with multiple broadcasting, enhanced mesh) has reached some attention as an architecture for parallel processing, especially suitable for digital geometry problems. It consists of a p n pn grid of processors. Each processor is connected to its (at most four) neighbors by local links. Additionally, each row and column is equipped with a bus for long-distance communication. Each processor has local memory (of usually O(log n) bits) and applies in each time unit a global instruction to its own data. A processor can perform local computations and send and receive messages through the local links and buses. All processors in a row or column, respectively, can simultaneously read the message from the bus, but for technical reasons only one message per step can be broadcast by the bus. If only one processor per step is allowed to send a message, we speak of CREW buses, according to the terminology for PRAMs. Since the first treatment in [9], many complexity results have been obtained for this architecture. Usually one assumes that the input has length n and is pretiled onto the whole mesh, or it has length k < n and stands initially in k prescribed processors [2]. The mesh is often considered as a natural device for geometric problems on digital images ([9] [10] [3] and others) – every processor represents a pixel or a square of pixels and stores a color, in the simplest case only black or white. But the known mesh algorithms for geometric problems have complexities depending on the number n of processors, even for sparse digital images with k n black pixels. On the other hand, in ”continuous” computational geometry complexities are provided as functions of the number of given geometric features (points, line segments etc.). This discrepancy provokes the following metaproblem: Can we solve the fundamental computational problems for sparse inputs on the mesh in time depending only on the number k of given features, regardless the size n of the mesh? Note that the k items stand in arbitrary positions now. In [5] we give a general negative answer if CREW buses are presumed: Just the problem of informing some target processor about the coordinates of k items has a complexity of (n), even if k = 2. (Case k = 1 is trivially solvable in constant time.) The results still holds if the processors have unbounded power and memory. The proof is a nontrivial adversary argument using only the prohibition of concurrent writing. So what happens if we allow concurrent write access to the buses? Since a bus can carry only one message simultaneously, we must demand that concurrent processors write the same. Therefore we speak of COMMON CRCW buses. Such modifications of the mesh model were already suggested in early papers, e.g. [9]. But for problems with dense input this seems to be irrelevant: many provably optimal mesh algorithms require only CREW buses. The aim of this paper is to demonstrate that the mesh with COMMON CRCW buses can quickly solve problems with sparse inputs, in the above sense. As a prototype problem we study semigroup computations. Let (S;+) be a fixed commutative semigroup, w.l.o.g. containing a zero element. Examples are the sum, product, or maximum of numbers; for short we will always speak of ”sums” throughout the paper. Given one element from S in each processor, we wish to compute the sum of all these entries. This problem is well-understood on enhanced meshes. As shown in [9], the time complexity is (n). Somewhat surprisingly, suitable non-square meshes can perform semigroup computation even in (n) time [1]. An optimal time bound for semigroup computation with k nonzero entries has been derived in [2], but the relevant items are assumed to stand initially in prescribed processors. In the present paper we consider the semigroup problem with k nonzero entries in k unknown processors. Even the number k may be unknown in advance. With COMMON CRCW buses we will obtain time bounds of (k) for k n, and O(k log k). As an illustrative example for the relevance of sparse semigroup computation consider e.g. the computation of the centroid of k given points in the digital plane. However, semigroup computation is a basic problem occuring as a subproblem in many contexts. Whereas the earlier semigroup algorithms for items in prescribed processors [9] [1] [2] are of quite simple regular structure, we must now combine the data movement techniques in a more refined way. Additionally (and most importantly) we need a result from [11] concerning the parallel compaction of scattered information. We are convinced that the introduced approach is also applicable to the other basic computational problems, rather than semigroup computation. For obtaining a first insight we choosed the semigroup problem as a subject of study because of its particular simplicity. We conclude this section with some terminology. We presume that the reader is familiar with the PRAM models, see e.g. [7]. Throughout the paper, the term mesh means the p n pn processor array with local links and a COMMON CRCW bus in each row and column. Every processor knows its coordinates (i; j) (1 i; j pn) where i and j is the number of the row and column, respectively, where the processor is located. A submesh is the set of all processors (i; j) satisfying a i b and c j d for arbitrarily given a; b; c; d. Let i1 < : : : < ir and j1 < : : : < js be arbitrarily choosen row and column numbers. A subgrid is the set of all processors (i; j) such that i = i and j = j where 1 r and 1 s. So every submesh is a subgrid, but not vice versa. With respect to a fixed subgrid, ( ; ) are called the relative coordinates of (i; j). A subgrid has width w if w is smaller than all i +1 i and j +1 j . Consider a fixed instance of the semigroup problem. An item is a nonzero entry, and k denotes the number of items. A subgrid is called occupied if it includes all items of our instance. So the minimal occupied subgrid contains at least one item in each row and column, and no item is outside. Two instances with equal sums are called equivalent. The general semigroup problem is sensitive, i.e. for an arbitrary partial input it is impossible to compute the sum whithout knowing the last missing input value. But note that this is not valid in all special semigroups, consider e.g. (S;max) with a finite set S. 2: The Complexity of Sparse Semigroup Computation For proving a simple lower bound we adapt the wellknown information flow argument (see e.g. [1]) to our problem. Theorem 1 Any algorithm for general semigroup computation on the mesh requires (k) time if k n. Proof Consider instances with k items in a k k subgrid of width k. Obviously, in the first k=2 steps no processor can become aware of more than one item by local communication only. In the following we pretend that, whenever some processor writes onto some bus, all processors in the mesh (not only along the bus) receive this message ”for free”. This concession might only reduce the lower bound. An item written onto some bus in step t is called new if it has not been broadcast by buses before step t. Clearly, while t < k=2 only O(k) buses can broadcast new items. Each of these buses can carry only one semigroup element (i.e. an original item or a partial sum) per step. Furthermore, every new item is an original item of the given instance, rather than a partial sum. Hence any fixed processor has received O(kt) items after t steps. Due to sensitivity, a processor must know all the k items in order to compute the sum. This yields k = O(kt), and the assertion follows. 2 Note that we obtain for larger k the well-known (n) bound. The proof applies to any sensitive problem. Now we wish to derive a possibly tight asymptotic upper bound. One principal problem is to find quickly the k items being scattered somewhere on the mesh. In [11], P.Ragde provided an algorithm for ordered compaction on the COMMON CRCW PRAM with m memory cells and p m processors. Based on perfect hashing, it moves q items from arbitrary memory cells into the first q cells, preserving the ordering of the items, in O(log q= log log p) time, and there is no need to know q in advance. We shall simulate this algorithm on the mesh and, with help of this, we determine the minimal occupied subgrid for our given instance, i.e. we inform each processor about its relative coordinates. Since the at most k k processors in our occupied subgrid are responsive then by their relative coordinates, we can afterwards apply the standard techniques in this subgrid, thus obtaining a complexity bound depending on k only. Lemma 2 An O(t) time computation on a COMMON CRCW PRAM with m memory cells and p processors can also be executed on an m p mesh in O(t) time without using the local links. Proof We devote a processor in the first row of the mesh to each PRAM processor, and a processor in the first column to each memory cell. (So processor (1; 1) is covered twice which doesn' t matter.) W.l.o.g. assume that the PRAM performs local computations, read and write operations in distinct steps, and a processor is concerned with at most one cell each time (but possibly several processors with the same cell). We have to simulate every PRAM step in constant time on the mesh. This is straightforward in all three cases and can be omitted here. Only note that, in fact, there arise no write conflicts on the buses. 2 Lemma 3 Given an instance with k items, the minimal occupied subgrid can be determined in O(log k) time. That means, in this time we can tell all processors in the minimal occupied subgrid their relative coordinates. Proof Each processor containing an item writes some special symbol onto its row bus. Reading from the row buses, each processor in the first column knows whether or not its row contains at least one item. A processor in the first column stores a row indicator iff its row is nonempty. By Lemma 2 we can use Ragde's ordered compaction algorithm to move the q minfk;png row indicators into the first q processors of the leftmost column, preserving their ordering. Note that a q pn subgrid is available for this, hence we need O(log q= log log p n) = O(log k) time. This is the only time-consuming procedure, all other steps can be done in constant time. Supplying the row indicators with their original row numbers, we thus obtain in the first q processors an ordered list of the occupied rows. Clearly, the position of a row indicator in this list is the row component of the relative coordinates. We now additionally supply the row indicators with these list positions. Next distribute the row indicators to q different columns using the row buses, and move them back to their original rows through the column buses. Finally, every processor containing a row indicator sends its list position to all processors of its row. Proceed similarly with the columns. Obviously, a processor belongs to the minimal occupied subgrid iff it has received two numbers from the row and column indicators, and these are exactly the relative coordinates. 2 In the following let r denote the largest relative coordinate value. So we have an occupied subgrid of size r r, obtained from the minimal ocupied subgrid by adding some empty rows or columns. Once the relative coordinates are determined, we could now simply give an O(r) algorithm, adopting parts of the semigroup algorithm from [9]. If our occupied subgrid is completely full (k = r) then this is O(k). But in the other extremal case k = r (each row or column has only one item) this solution would be bad – we only get O(k). In the latter case it is better to ask for the positions of the items in each row by binary search, and to remove and add the items found in this way. But if some rows are fully occupied and others are sparse, both approaches give time bounds with avoidable high exponents. The general difficulty is that we obtain from applying Lemma 3 neither k (which may be any number from r to r) nor the distribution of the items in our occupied subgrid. Our next idea is to combine the two approaches for the full and sparse case appropriately. The next three lemmas establish some basic procedures for this. Lemma 4 Assume all items stand in some r r subgrid. Then we can transform the given instance in O(w) time into an equivalent instance, contained in an occupied subgrid of size s s (s r) and width w. Proof Partition the entire mesh into submeshes of size w w, sum up the items in each submesh in O(w) time, and store the partial sums in the left upper corners of the submeshes. For this we only need the local links, so we can proceed in all submeshes in parallel. Obviously, the number of occupied rows and columns cannot increase. 2 Lemma 5 If all items in a row stand in r processors with known identification numbers 1; : : : ; r then we can remove and add q of them in O(q log r) time, using only the row
منابع مشابه
Semigroup and Prefix Computations on Improved Generalized Mesh-Connected Computers with Multiple Buses
Various augmenting mechanisms have been proposed to enhance the communication e ciency of mesh-connected computers (MCC's). One major approach is to add noncon gurable buses for improved broadcasting. A typical example is the mesh-connected computer with multiple buses (MMB). In this paper, we propose a new class of generalized MMB's, the improved generalized MMB's (IMMB's). Each processor in a...
متن کاملRelations Between Several Parallel Computational Models
We investigate the relative computational power of parallel models with shared memory. Based on feasibility considerations present in the literature, we split these models into “lightweight” and “heavyweight,” and then find that the heavyweight class is strictly more powerful than the lightweight class, as expected. On the other hand, we contradict the long held belief that the heavyweight mode...
متن کاملSimulation of Meshes with Separable Buses by Meshes with Multiple Partitioned Buses
This paper studies the simulation problem of meshes with separable buses (MSB) by meshes with multiple partitioned buses (MMPB). The MSB and the MMPB are the mesh connected computers enhanced by the addition of broadcasting buses along every row and column. The broadcasting buses of the MSB, called separable buses, can be dynamically sectioned into smaller bus segments by program control, while...
متن کاملGeometric approach for optimal routing on mesh with buses
The architecture of 'mesh of buses' is an important model in parallel computing. Its main advantage is that the additional broadcast capability can be used to overcome the main disadvantage of the mesh, namely its relatively large diameter. We show that the addition of buses indeed accelerates routing times. Furthermore, unlike in thèstore and forward' model, the routing time becomes proportion...
متن کاملRouting with Locality on Meshes with Buses
a d 1 O(d/f(d)) step, O( f(d)) buffer size routing algorithm which is asymptotically optimal if f(d) is chosen to be a large constant. In our study, we assume all the processors operate in synchronous MIMD mode. At any time step, each processor can communicate with all of its grid neighbors and can both send and receive one packet along each mesh link. In addition, processors can also store pac...
متن کامل